In this project, I will try to find how Starbucks customers use the app, and how well is the current offers system. I will also see who should the app target in promotions. The data sets used in this project contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. From it, we can understand the costumers' behavior and it might help us make better decisions.
The problem we have here is that we don't want to give any customer our offers. We want to give only those who we think will be able to complete the offer. Giving an offer to someone we know he/she probably will not be able to complete it is a waste of time and resources that can be given to someone who we know will complete it. I will approach this problem by first cleaning up the data, then doing some exploratory analysis and see who are my most valuable customers after that I will create a model to help us predicting feature customers and which type of offer should we give them.
My objective here is to find patterns and show when and where to give specific offer to a specific customer. Main users of this kind of applications are Starbucks employees and analysts. The plan in this project to have questions and answer them with data visualization. Tha data is provided by Starbucks contains simulated data that mimics customer behavior.
In this project we were given 3 files. Before I start analyzing we have to explore and see what is the data we have. We need to check if it is clean or not, if each column have the right type that the data tell, for example if the data in column called price is saved as string, we need to convert it to number to help us in the analysis if we want to find the sum for example, having it as string will not return the total of that column. Similar thing goes to dates saved as strings.
# import libraries
import pandas as pd
import numpy as np
import math
import json
import plotly
import plotly.graph_objs as go
import matplotlib.pyplot as plt
import seaborn as sns
plotly.offline.init_notebook_mode()
%matplotlib inline
from sklearn.preprocessing import MultiLabelBinarizer
import warnings
warnings.filterwarnings('ignore')
# read in the json files
portfolio = pd.read_json('data/portfolio.json', orient='records', lines=True)
profile = pd.read_json('data/profile.json', orient='records', lines=True)
transcript = pd.read_json('data/transcript.json', orient='records', lines=True)
The data we have is provided by Starbucks. Here is a quick breakthrough of how the data looks like:
portfolio.sample(5)
| reward | channels | difficulty | duration | offer_type | id | |
|---|---|---|---|---|---|---|
| 3 | 5 | [web, email, mobile] | 5 | 7 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 |
| 1 | 10 | [web, email, mobile, social] | 10 | 5 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 |
| 8 | 5 | [web, email, mobile, social] | 5 | 5 | bogo | f19421c1d4aa40978ebb69ca19b0e20d |
| 5 | 3 | [web, email, mobile, social] | 7 | 7 | discount | 2298d6c36e964ae4a3e7e9706d1fb8c2 |
| 4 | 5 | [web, email] | 20 | 10 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 |
portfolio.shape
(10, 6)
profile.sample(5)
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 3680 | F | 65 | 44eae499188b4c5c849741ac698aed4b | 20161216 | 97000.0 |
| 11260 | None | 118 | ea796e04a85942e3a77985ed1f9a1940 | 20170420 | NaN |
| 13033 | M | 29 | 17a83b17fa7b4293989b830cfa30546a | 20171121 | 38000.0 |
| 14310 | M | 20 | bd52aa1e72284ae69f476817c9989794 | 20180123 | 39000.0 |
| 15943 | M | 65 | 6b1c9e0cc8a54186a4129131a1726738 | 20180105 | 100000.0 |
profile.shape
(17000, 5)
transcript.sample(5)
| person | event | value | time | |
|---|---|---|---|---|
| 187613 | ce0d0f8ee78448568754d6f9d8c90b1f | transaction | {'amount': 20.41} | 456 |
| 165717 | b2aaf8311ae9422981ded134c2bbbc34 | offer viewed | {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} | 408 |
| 290871 | feca5113ed4f47268bcde4c176c6dd9f | transaction | {'amount': 1.03} | 648 |
| 210912 | 75d9de6a8e66420ba4215e53003e0f42 | offer received | {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} | 504 |
| 137747 | f277d871eb28447bb86e4b73a9bc4379 | transaction | {'amount': 30.4} | 366 |
transcript.shape
(306534, 4)
In this part, I will go though the data and do some data wrangling and fix some issues with coulmns such as Value in the transcript table, and Channels in the portfolio and others.
First list check data types
portfolio
| reward | channels | difficulty | duration | offer_type | id | |
|---|---|---|---|---|---|---|
| 0 | 10 | [email, mobile, social] | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd |
| 1 | 10 | [web, email, mobile, social] | 10 | 5 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 |
| 2 | 0 | [web, email, mobile] | 0 | 4 | informational | 3f207df678b143eea3cee63160fa8bed |
| 3 | 5 | [web, email, mobile] | 5 | 7 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 |
| 4 | 5 | [web, email] | 20 | 10 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 |
| 5 | 3 | [web, email, mobile, social] | 7 | 7 | discount | 2298d6c36e964ae4a3e7e9706d1fb8c2 |
| 6 | 2 | [web, email, mobile, social] | 10 | 10 | discount | fafdcd668e3743c1bb461111dcafc2a4 |
| 7 | 0 | [email, mobile, social] | 0 | 3 | informational | 5a8bc65990b245e5a138643cd4eb9837 |
| 8 | 5 | [web, email, mobile, social] | 5 | 5 | bogo | f19421c1d4aa40978ebb69ca19b0e20d |
| 9 | 2 | [web, email, mobile] | 10 | 7 | discount | 2906b810c7d4411798c6938adc9daaa5 |
portfolio.dtypes
reward int64 channels object difficulty int64 duration int64 offer_type object id object dtype: object
One issue with the portfolio dataframe is having a list of items in the channels column, to fix this one I will do something similar to one-hot-encoding and make three new columns for each type of channels and put 1 when it applies on that promotion and 0 if it doesn't apply.
mlb = MultiLabelBinarizer()
res = pd.DataFrame(mlb.fit_transform(portfolio['channels']),
columns=mlb.classes_,
index=portfolio['channels'].index)
portfolio = portfolio.drop('channels', axis=1)
portfolio = pd.concat([portfolio, res], axis=1, sort=False)
portfolio
| reward | difficulty | duration | offer_type | id | mobile | social | web | ||
|---|---|---|---|---|---|---|---|---|---|
| 0 | 10 | 10 | 7 | bogo | ae264e3637204a6fb9bb56bc8210ddfd | 1 | 1 | 1 | 0 |
| 1 | 10 | 10 | 5 | bogo | 4d5c57ea9a6940dd891ad53e9dbe8da0 | 1 | 1 | 1 | 1 |
| 2 | 0 | 0 | 4 | informational | 3f207df678b143eea3cee63160fa8bed | 1 | 1 | 0 | 1 |
| 3 | 5 | 5 | 7 | bogo | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 1 | 1 | 0 | 1 |
| 4 | 5 | 20 | 10 | discount | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 1 | 0 | 0 | 1 |
| 5 | 3 | 7 | 7 | discount | 2298d6c36e964ae4a3e7e9706d1fb8c2 | 1 | 1 | 1 | 1 |
| 6 | 2 | 10 | 10 | discount | fafdcd668e3743c1bb461111dcafc2a4 | 1 | 1 | 1 | 1 |
| 7 | 0 | 0 | 3 | informational | 5a8bc65990b245e5a138643cd4eb9837 | 1 | 1 | 1 | 0 |
| 8 | 5 | 5 | 5 | bogo | f19421c1d4aa40978ebb69ca19b0e20d | 1 | 1 | 1 | 1 |
| 9 | 2 | 10 | 7 | discount | 2906b810c7d4411798c6938adc9daaa5 | 1 | 1 | 0 | 1 |
For portfolio dataframe, we can see that we don't have NaN values.
profile.head()
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | None | 118 | 68be06ca386d4c31939f3a4f0e3dd783 | 20170212 | NaN |
| 1 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 20170715 | 112000.0 |
| 2 | None | 118 | 38fe809add3b4fcf9315a9694bb96ff5 | 20180712 | NaN |
| 3 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 20170509 | 100000.0 |
| 4 | None | 118 | a03223e636434f42ac4c3df47e8bac43 | 20170804 | NaN |
profile.dtypes
gender object age int64 id object became_member_on int64 income float64 dtype: object
First we can notice that we have some NaN values in the gender and income, lets find the sum of in all columns.
profile.isna().sum()
gender 2175 age 0 id 0 became_member_on 0 income 2175 dtype: int64
As we can see we have some NaN values in gender and income columns, for gender I will fill NaNs with N/A, and for income I will fill NaNs with the mean.
profile['gender'].fillna('NA', inplace=True)
profile['income'].fillna((profile['income'].mean()), inplace=True)
While working through the data I notices that there is age number 118 which seems impossible to happen and to make sure I also noticed that for all profiles that have age equal to 118, they also don't have gender listed so it might be written wrong or it is the default value. For those values I will keep 118 as it is.
profile.isna().sum()
gender 0 age 0 id 0 became_member_on 0 income 0 dtype: int64
profile.dtypes
gender object age int64 id object became_member_on int64 income float64 dtype: object
Lastly lets check transcript dataframe
transcript.head()
| person | event | value | time | |
|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | {'offer id': '9b98b8c7a33c4b65b9aebfe6a799e6d9'} | 0 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | {'offer id': '0b1e1539f2cc45b7b9fa7c272da2e1d7'} | 0 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | {'offer id': '2906b810c7d4411798c6938adc9daaa5'} | 0 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | {'offer id': 'fafdcd668e3743c1bb461111dcafc2a4'} | 0 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | {'offer id': '4d5c57ea9a6940dd891ad53e9dbe8da0'} | 0 |
transcript.dtypes
person object event object value object time int64 dtype: object
Checking for NaN values we can see that we have none, we have to clean the value column since it hold dictonary of offer id, amount or reward.
transcript.isna().sum()
person 0 event 0 value 0 time 0 dtype: int64
# find different keys in value column
keys = []
for idx, row in transcript.iterrows():
for k in row['value']:
if k in keys:
continue
else:
keys.append(k)
keys
['offer id', 'amount', 'offer_id', 'reward']
As we can see above there are offer_id and offer id, after fixing this we need to concatinate those columns togther since they are the same.
# Iterate over transcript table, check value column and update it, put each key in seperated column.
transcript['offer_id'] = ''
transcript['amount'] = 0
transcript['reward'] = 0
for idx, row in transcript.iterrows():
for k in row['value']:
if k == 'offer_id' or k == 'offer id':
transcript.at[idx, 'offer_id']= row['value'][k]
if k == 'amount':
transcript.at[idx, 'amount']= row['value'][k]
if k == 'reward':
transcript.at[idx, 'reward']= row['value'][k]
# example
transcript.loc[306506]
person b895c57e8cd047a8872ce02aa54759d6
event offer completed
value {'offer_id': 'fafdcd668e3743c1bb461111dcafc2a4...
time 714
offer_id fafdcd668e3743c1bb461111dcafc2a4
amount 0
reward 2
Name: 306506, dtype: object
Now no need to keep the value column and we can drop it.
transcript = transcript.drop('value', axis=1)
transcript.head()
| person | event | time | offer_id | amount | reward | |
|---|---|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | 0 | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 0 | 0 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | 0 | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 0 | 0 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | 0 | 2906b810c7d4411798c6938adc9daaa5 | 0 | 0 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | 0 | fafdcd668e3743c1bb461111dcafc2a4 | 0 | 0 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | 0 | 4d5c57ea9a6940dd891ad53e9dbe8da0 | 0 | 0 |
Now, the data looks ready for some analysis and machine learning.
Divided to two parts, Analysis and Modeling, I will start with Analysis where I will analyze the data and try to find insights.
1- How many varibales do I have in each dataframe? What are the types?
2- What are the most common values for each column in each dataframe?
3- What is the average income for Starbucks customers?
4- What is the average age for Starbucks customers?
5- What is the most common promotion?
6- What is the least common promotion?
7- Who are the most loyal customer (most transcripts)?
8- What are the most events we have in our transcripts?
1- What is the most common promotion for children, teens, adult and elderly customors?
2- From profiles, which get more income, males or females?
3- What is the gender distribution in the transcript dataframe?
4- Who takes long time to achieve each promotion goal and from which gender, age, income?
5- How many customers we get each month (became_member_on)?
6- What is the average length between two transcript for the same customer?
7- Which type of promotions each gender likes (offer_type)?
8- From each offer received by customer, how many they completed?
We already looked into the shape and check of NaNs and viewed the type of each column. Lets see for numerical and categorical data, how many each appeared, what is the mean and min/max .. etc.
profile.head(5)
| gender | age | id | became_member_on | income | |
|---|---|---|---|---|---|
| 0 | NA | 118 | 68be06ca386d4c31939f3a4f0e3dd783 | 20170212 | 65404.991568 |
| 1 | F | 55 | 0610b486422d4921ae7d2bf64640c50b | 20170715 | 112000.000000 |
| 2 | NA | 118 | 38fe809add3b4fcf9315a9694bb96ff5 | 20180712 | 65404.991568 |
| 3 | F | 75 | 78afa995795e4d85b5d9ceeca43f5fef | 20170509 | 100000.000000 |
| 4 | NA | 118 | a03223e636434f42ac4c3df47e8bac43 | 20170804 | 65404.991568 |
profile.describe()
| age | became_member_on | income | |
|---|---|---|---|
| count | 17000.000000 | 1.700000e+04 | 17000.000000 |
| mean | 62.531412 | 2.016703e+07 | 65404.991568 |
| std | 26.738580 | 1.167750e+04 | 20169.288288 |
| min | 18.000000 | 2.013073e+07 | 30000.000000 |
| 25% | 45.000000 | 2.016053e+07 | 51000.000000 |
| 50% | 58.000000 | 2.017080e+07 | 65404.991568 |
| 75% | 73.000000 | 2.017123e+07 | 76000.000000 |
| max | 118.000000 | 2.018073e+07 | 120000.000000 |
plt.figure(figsize=(13, 4))
sns.boxplot(profile['age'])
plt.title('Age Boxplot')
plt.xlabel('Age')
plt.xticks(rotation = 90)
plt.show();
Form the about box plot, we can see that most of ages in our profile dataframe falls in-between 40 and 80. We already notice one outlier which is 118. Our median is around 58 years old.
What about income?
plt.figure(figsize=(16, 4))
sns.boxplot(profile['income'])
plt.title('Income Boxplot')
plt.xlabel('income')
plt.xticks(rotation = 90)
plt.show();
Our boxplot shows that the median is around 65k and most of incomes falls between 50k and 78k.
For Age, we have 85 different ages, so to make the graph look better I think we should divide ages to groups:
profile['age'].value_counts().shape[0]
85
# reference: https://twitter.com/justmarkham/status/1146040449678925824
profile['age_groups'] = pd.cut(profile.age, bins=[0, 12, 18, 21, 64, 200], labels=['child', 'teen', 'young adult', 'adult', 'elderly'])
# I am ignoring the outlier age 118
plt.figure(figsize=(16, 4))
top10_ages = profile['age'].value_counts()[1:].head(10).reset_index()
plt.bar(top10_ages['index'], top10_ages['age'])
plt.title('Common Ages in Profiles')
plt.ylabel('Number of Profiles')
plt.xlabel('Age')
plt.xticks(top10_ages['index'], rotation = 0)
plt.show();
sns.countplot(x='age_groups', data=profile)
plt.title('Number of Profiles In Each Age Group')
plt.ylabel('Number of Profiles')
plt.xlabel('Age Group')
plt.xticks(rotation = 45)
plt.show();
sns.countplot(profile['gender'])
plt.title('Genders in Profiles')
plt.ylabel('Total')
plt.xlabel('Gender')
plt.xticks(rotation = 0)
plt.show();
We can see that we have alot profile in the adult age group, ages between 21 and 64, lets see more about this and look inside the adult age group and see how it is divided.
adults = profile[profile['age_groups'] == 'adult']
plt.figure(figsize=(16, 6))
sns.countplot(adults['age'])
plt.title('Number of Profiles In Each Age')
plt.ylabel('Number of Profiles')
plt.xlabel('Age')
plt.xticks(rotation = 45)
plt.show();
For become_member_on column, because we have alot of dates, I will group it by month.
profile['became_member_on_month'] = profile['became_member_on'].astype(str).str[:6]
plt.figure(figsize=(16, 6))
sns.countplot(profile['became_member_on_month'])
plt.title('Number of Profiles In Each Month')
plt.ylabel('Number of Profiles')
plt.xlabel('Year-Month')
plt.xticks(rotation = 90)
plt.show();
We can see that we did better job between August 2017 and January 2018. We were able to get more than 800 new profiles monthly
For transcript dataframe, lets see the common values we have.
sns.countplot(transcript['event'])
plt.title('Number of events In Transcripts')
plt.ylabel('Number of Transcripts')
plt.xlabel('Transcript type')
plt.xticks(rotation = 0)
plt.show();
We can see that most of the transcripts are transactions. Around 75% of the offer received were viewed. And nearly 50% of the viewed offers were completed.
profile['income'].describe()['mean']
65404.99156829799
profile['age'].describe()['mean']
62.53141176470588
Here I will be looking for only completed promotions.
completed_off_count = transcript[transcript['event'] == 'offer completed']
sns.countplot(y=completed_off_count['offer_id'])
plt.title('Number of Completed Promotion for each Offer')
plt.ylabel('Promotion ID')
plt.xticks(rotation = 45)
plt.show();
print(f'Offer ID: {completed_off_count["offer_id"].value_counts().index[0]}')
print(f'Number of Completion: {completed_off_count["offer_id"].value_counts().values[0]}')
Offer ID: fafdcd668e3743c1bb461111dcafc2a4 Number of Completion: 5317
Now lets look at the most common types of offers, to find that I need to get the offer type from the portfolio dataframe.
def get_offer_type(offer_id):
try:
offer_type = portfolio[portfolio['id'] == offer_id]['offer_type'].values[0]
return offer_type
except:
offer_type = 'NA'
return offer_type
transcript['offer_type'] = transcript.apply(lambda x: get_offer_type(x['offer_id']), axis=1)
sns.countplot(transcript[transcript['offer_type'] != 'NA']['offer_type'])
plt.title('Number of Completed Promotion for Type Offer')
plt.ylabel('Number of transactions')
plt.xlabel('Offer Type')
plt.xticks(rotation = 45)
plt.show();
As we saw in the above graph, it is pretty close between BOGO offers and discount offers.
print(f'Offer ID: {completed_off_count["offer_id"].value_counts().index[-1]}')
print(f'Number of Completion: {completed_off_count["offer_id"].value_counts().values[-1]}')
Offer ID: 4d5c57ea9a6940dd891ad53e9dbe8da0 Number of Completion: 3331
For this one, I will check offer completed and transactions event types.
loyal_customer_count = transcript[(transcript['event'] == 'offer completed') | (transcript['event'] == 'transaction')].groupby(['person', 'event'])['amount'].sum().reset_index()
loyal_customer_count = loyal_customer_count.sort_values('amount', ascending=False).head(10)
count = 1
print(' ')
print(' *************** [ LEADERBOARD ] **************')
print(' ')
for idx, row in loyal_customer_count.iterrows():
print(f'.------------------- [ #{count} ] ------------------.')
print(f'| Profile ID: {row["person"]} |')
print(f'| Number of Completed Offers: {completed_off_count[(completed_off_count["person"] == row["person"]) & (completed_off_count["event"] == "offer completed")].shape[0]} |')
print(f'| Amount: ${row["amount"]} |')
print(f"'----------------------------------------------'")
count += 1
*************** [ LEADERBOARD ] ************** .------------------- [ #1 ] ------------------. | Profile ID: 3c8d541112a74af99e88abbd0692f00e | | Number of Completed Offers: 5 | | Amount: $1606 | '----------------------------------------------' .------------------- [ #2 ] ------------------. | Profile ID: f1d65ae63f174b8f80fa063adcaa63b7 | | Number of Completed Offers: 6 | | Amount: $1360 | '----------------------------------------------' .------------------- [ #3 ] ------------------. | Profile ID: ae6f43089b674728a50b8727252d3305 | | Number of Completed Offers: 3 | | Amount: $1320 | '----------------------------------------------' .------------------- [ #4 ] ------------------. | Profile ID: 626df8678e2a4953b9098246418c9cfa | | Number of Completed Offers: 4 | | Amount: $1314 | '----------------------------------------------' .------------------- [ #5 ] ------------------. | Profile ID: 73afdeca19e349b98f09e928644610f8 | | Number of Completed Offers: 5 | | Amount: $1314 | '----------------------------------------------' .------------------- [ #6 ] ------------------. | Profile ID: 52959f19113e4241a8cb3bef486c6412 | | Number of Completed Offers: 5 | | Amount: $1285 | '----------------------------------------------' .------------------- [ #7 ] ------------------. | Profile ID: ad1f0a409ae642bc9a43f31f56c130fc | | Number of Completed Offers: 3 | | Amount: $1256 | '----------------------------------------------' .------------------- [ #8 ] ------------------. | Profile ID: d240308de0ee4cf8bb6072816268582b | | Number of Completed Offers: 5 | | Amount: $1244 | '----------------------------------------------' .------------------- [ #9 ] ------------------. | Profile ID: 946fc0d3ecc4492aa4cc06cf6b1492c3 | | Number of Completed Offers: 4 | | Amount: $1224 | '----------------------------------------------' .------------------- [ #10 ] ------------------. | Profile ID: 6406abad8e2c4b8584e4f68003de148d | | Number of Completed Offers: 3 | | Amount: $1206 | '----------------------------------------------'
sns.countplot(transcript['event'])
plt.title('Number of events In Transcripts')
plt.ylabel('Number of Transcripts')
plt.xlabel('Transcript type')
plt.xticks(rotation = 0)
plt.show();
Transaction have the most amount of rows in the transcript dataframe with around 140k, almost half of our dataframe total.
To find out, we need to get customer's age in the transcript dataframe. I created a function to get that from the profile dataframe (It takes time to run). I will ignore 'child' age group since we have no rows of it.
def get_customer_age_group(profile_id):
age_group = profile[profile['id'] == profile_id]['age_groups'].values[0]
return age_group
transcript['age_group'] = transcript.apply(lambda x: get_customer_age_group(x['person']), axis=1)
plt.figure(figsize=(14, 6))
sns.countplot(x="age_group", hue="offer_type", data=transcript)
plt.title('Most Popular Offers to Each Age Group')
plt.ylabel('Total')
plt.xlabel('Age Group')
plt.xticks(rotation = 0)
plt.legend(title='Offer Type')
plt.show();
Here I ignored those who didn't tell their gender.
plt.figure(figsize=(14, 6))
sns.violinplot(x=profile[profile['gender'] != 'NA']['gender'], y=profile['income'])
plt.title('Income vs Gender')
plt.ylabel('Income')
plt.xlabel('Gender')
plt.xticks(rotation = 0)
plt.show();
The graph above shows that income median (the white dot) for females (around 70k) is higher than males (around 60k) we can also see that for females the income spreads from 40k to 100k. For males most of them around 40k to 70k which close to median.
We need to add gender also here, I will make function to do so (It takes time to run).
def get_customer_gender(profile_id):
gender = profile[profile['id'] == profile_id]['gender'].values[0]
return gender
transcript['gender'] = transcript.apply(lambda x: get_customer_gender(x['person']), axis=1)
plt.figure(figsize=(14, 6))
sns.countplot(x=transcript[transcript["gender"] != 'NA']['gender'], hue="offer_type", data=transcript)
plt.title('Most Popular Offers to Each Gender')
plt.ylabel('Total')
plt.xlabel('Gender')
plt.xticks(rotation = 0)
plt.legend(title='Offer Type')
plt.show();
plt.figure(figsize=(14, 6))
sns.countplot(x=transcript[transcript["gender"] != 'NA']['gender'], hue="event", data=transcript)
plt.title('Most Popular Offer Event to Each Gender')
plt.ylabel('Total')
plt.xlabel('Gender')
plt.xticks(rotation = 0)
plt.legend(title='Offer Event')
plt.show();
print(f'Number of males records {transcript[transcript["gender"] == "M"].shape[0]}, and number of female records is {transcript[transcript["gender"] == "F"].shape[0]}.')
Number of males records 155690, and number of female records is 113101.
From the two graphs above we can see that males received offers more than females. Both genders seems to reflect on those offers similarly. Around half of 80% of offers received were viewed by both genders, but it seems that females would complete those offers more than males. The numbers are:
total_trans_g_o = transcript[transcript["gender"] != 'NA'].groupby(['gender','offer_type']).count()
total_trans_g_e = transcript[transcript["gender"] != 'NA'].groupby(['gender','event']).count()
total_trans_g_o, total_trans_g_e
( person event time offer_id amount reward \
gender offer_type
F NA 49382 49382 49382 49382 49382 49382
bogo 27619 27619 27619 27619 27619 27619
discount 26652 26652 26652 26652 26652 26652
informational 9448 9448 9448 9448 9448 9448
M NA 72794 72794 72794 72794 72794 72794
bogo 35301 35301 35301 35301 35301 35301
discount 34739 34739 34739 34739 34739 34739
informational 12856 12856 12856 12856 12856 12856
O NA 1781 1781 1781 1781 1781 1781
bogo 914 914 914 914 914 914
discount 920 920 920 920 920 920
informational 356 356 356 356 356 356
age_group
gender offer_type
F NA 49382
bogo 27619
discount 26652
informational 9448
M NA 72794
bogo 35301
discount 34739
informational 12856
O NA 1781
bogo 914
discount 920
informational 356 ,
person time offer_id amount reward offer_type \
gender event
F offer completed 15477 15477 15477 15477 15477 15477
offer received 27456 27456 27456 27456 27456 27456
offer viewed 20786 20786 20786 20786 20786 20786
transaction 49382 49382 49382 49382 49382 49382
M offer completed 16466 16466 16466 16466 16466 16466
offer received 38129 38129 38129 38129 38129 38129
offer viewed 28301 28301 28301 28301 28301 28301
transaction 72794 72794 72794 72794 72794 72794
O offer completed 501 501 501 501 501 501
offer received 916 916 916 916 916 916
offer viewed 773 773 773 773 773 773
transaction 1781 1781 1781 1781 1781 1781
age_group
gender event
F offer completed 15477
offer received 27456
offer viewed 20786
transaction 49382
M offer completed 16466
offer received 38129
offer viewed 28301
transaction 72794
O offer completed 501
offer received 916
offer viewed 773
transaction 1781 )
total_trans_go_o_t = total_trans_g_o.loc[('F')]['event'].sum()
total_trans_go_o_tt = total_trans_g_o.loc[('M')]['event'].sum()
total_trans_go_o_t_offers_f = total_trans_g_o.loc[('F')].loc[['bogo', 'discount', 'informational']]['event'].sum()
total_trans_go_o_t_offers_m = total_trans_g_o.loc[('M')].loc[['bogo', 'discount', 'informational']]['event'].sum()
print('For Females:')
print(f'Total transcipts is: {total_trans_go_o_t}.')
print(f"Number of bogo offers: {total_trans_g_o.loc[('F', 'bogo')].values[0]}, {round((total_trans_g_o.loc[('F', 'bogo')].values[0]/total_trans_go_o_t_offers_f)*100,2)}% of total.")
print(f"Number of discount offers: {total_trans_g_o.loc[('F', 'discount')].values[0]}, {round((total_trans_g_o.loc[('F', 'discount')].values[0]/total_trans_go_o_t_offers_f)*100,2)}% of total.")
print(f"Number of informational offers: {total_trans_g_o.loc[('F', 'informational')].values[0]}, {round((total_trans_g_o.loc[('F', 'informational')].values[0]/total_trans_go_o_t_offers_f)*100,2)}% of total.")
print(f"Number of offer completed: {total_trans_g_e.loc[('F', 'offer completed')].values[0]}, {round((total_trans_g_e.loc[('F', 'offer completed')].values[0]/total_trans_g_e.loc[('F', 'offer received')].values[0])*100,2)}% of total offers received.")
print(f"Number of offer received: {total_trans_g_e.loc[('F', 'offer received')].values[0]}, {round((total_trans_g_e.loc[('F', 'offer received')].values[0]/total_trans_go_o_t_offers_f)*100,2)}% of total.")
print(f"Number of offer viewed: {total_trans_g_e.loc[('F', 'offer viewed')].values[0]}, {round((total_trans_g_e.loc[('F', 'offer viewed')].values[0]/total_trans_go_o_t_offers_f)*100,2)}% of total.")
print(f"Number of transaction: {total_trans_g_e.loc[('F', 'transaction')].values[0]}, {round((total_trans_g_e.loc[('F', 'transaction')].values[0]/total_trans_go_o_t)*100,2)}% of total.")
print('\nFor Males:')
print(f'Total transcipts is: {total_trans_go_o_tt}.')
print(f"Number of bogo offers: {total_trans_g_o.loc[('M', 'bogo')].values[0]}, {round((total_trans_g_o.loc[('M', 'bogo')].values[0]/total_trans_go_o_t_offers_m)*100,2)}% of total.")
print(f"Number of discount offers: {total_trans_g_o.loc[('M', 'discount')].values[0]}, {round((total_trans_g_o.loc[('M', 'discount')].values[0]/total_trans_go_o_t_offers_m)*100,2)}% of total.")
print(f"Number of informational offers: {total_trans_g_o.loc[('M', 'informational')].values[0]}, {round((total_trans_g_o.loc[('M', 'informational')].values[0]/total_trans_go_o_t_offers_m)*100,2)}% of total.")
print(f"Number of offer completed: {total_trans_g_e.loc[('M', 'offer completed')].values[0]}, {round((total_trans_g_e.loc[('M', 'offer completed')].values[0]/total_trans_g_e.loc[('M', 'offer received')].values[0])*100,2)}% of total offers received.")
print(f"Number of offer received: {total_trans_g_e.loc[('M', 'offer received')].values[0]}, {round((total_trans_g_e.loc[('M', 'offer received')].values[0]/total_trans_go_o_t_offers_m)*100,2)}% of total.")
print(f"Number of offer viewed: {total_trans_g_e.loc[('M', 'offer viewed')].values[0]}, {round((total_trans_g_e.loc[('M', 'offer viewed')].values[0]/total_trans_go_o_t_offers_m)*100,2)}% of total.")
print(f"Number of transaction: {total_trans_g_e.loc[('M', 'transaction')].values[0]}, {round((total_trans_g_e.loc[('M', 'transaction')].values[0]/total_trans_go_o_t)*100,2)}% of total.")
For Females: Total transcipts is: 113101. Number of bogo offers: 27619, 43.34% of total. Number of discount offers: 26652, 41.83% of total. Number of informational offers: 9448, 14.83% of total. Number of offer completed: 15477, 56.37% of total offers received. Number of offer received: 27456, 43.09% of total. Number of offer viewed: 20786, 32.62% of total. Number of transaction: 49382, 43.66% of total. For Males: Total transcipts is: 155690. Number of bogo offers: 35301, 42.58% of total. Number of discount offers: 34739, 41.91% of total. Number of informational offers: 12856, 15.51% of total. Number of offer completed: 16466, 43.18% of total offers received. Number of offer received: 38129, 46.0% of total. Number of offer viewed: 28301, 34.14% of total. Number of transaction: 72794, 64.36% of total.
The numbers above shows that males receive offers more than females by 9% and their transaction is 19% more too, which tells that they both more than females. Regarding offers, Males and Females received the same amount of BOGO and discount offers.
They are pretty close if it not similar. Both males and females take about 15 days to complete and offer.
tran_avg_len_g_f = transcript.groupby(['gender', 'offer_id'])['time'].mean().reset_index()
tran_avg_len_g_m = transcript.groupby(['gender', 'offer_id'])['time'].mean().reset_index()
tran_avg_len_g_f = tran_avg_len_g_f[tran_avg_len_g_f['offer_id'] == '']
tran_avg_len_g_m = tran_avg_len_g_m[tran_avg_len_g_m['offer_id'] == '']
print(tran_avg_len_g_f[tran_avg_len_g_f['gender'] == 'F']['time'].values[0], tran_avg_len_g_f[tran_avg_len_g_f['gender'] == 'F']['time'].values[0] / 24)
print(tran_avg_len_g_f[tran_avg_len_g_f['gender'] == 'M']['time'].values[0], tran_avg_len_g_f[tran_avg_len_g_f['gender'] == 'M']['time'].values[0] / 24)
380.8600299704346 15.869167915434774 381.72731269060637 15.905304695441933
The mean time it tekes a customer to complete an offer is around 16 days (390 hours)
tran_avg_len = transcript.groupby(['person', 'offer_id'])['time'].mean().reset_index()
tran_avg_len = tran_avg_len[tran_avg_len['offer_id'] == '']
tran_avg_len['time'].mean(), tran_avg_len['time'].mean() / 24
(390.0493165677239, 16.252054856988497)
We can see that both genders like bogo and discount offers and they have the same reaction to informational offers, they both seem to be not intersted to it.
plt.figure(figsize=(14, 6))
sns.countplot(x=transcript[transcript["gender"] != 'NA']['gender'], hue="offer_type", data=transcript)
plt.title('Most Popular Offers to Each Gender')
plt.ylabel('Total')
plt.xlabel('Gender')
plt.xticks(rotation = 0)
plt.legend(title='Offer Type')
plt.show();
plt.figure(figsize=(14, 6))
sns.countplot(x=transcript[transcript["gender"] != 'NA']['gender'], hue="event", data=transcript)
plt.title('Most Popular Offer Event to Each Gender')
plt.ylabel('Total')
plt.xlabel('Gender')
plt.xticks(rotation = 0)
plt.legend(title='Offer Event')
plt.show();
print('For Females:')
print(f"Number of offer completed: {total_trans_g_e.loc[('F', 'offer completed')].values[0]}, {round((total_trans_g_e.loc[('F', 'offer completed')].values[0]/total_trans_g_e.loc[('F', 'offer received')].values[0])*100,2)}% of total offers received.")
print(f"Number of offer received: {total_trans_g_e.loc[('F', 'offer received')].values[0]}, {round((total_trans_g_e.loc[('F', 'offer received')].values[0]/total_trans_go_o_t_offers_f)*100,2)}% of total.")
print(f"Number of offer viewed: {total_trans_g_e.loc[('F', 'offer viewed')].values[0]}, {round((total_trans_g_e.loc[('F', 'offer viewed')].values[0]/total_trans_go_o_t_offers_f)*100,2)}% of total.")
print(f"Number of transaction: {total_trans_g_e.loc[('F', 'transaction')].values[0]}, {round((total_trans_g_e.loc[('F', 'transaction')].values[0]/total_trans_go_o_t)*100,2)}% of total.")
print('\nFor Males:')
print(f"Number of offer completed: {total_trans_g_e.loc[('M', 'offer completed')].values[0]}, {round((total_trans_g_e.loc[('M', 'offer completed')].values[0]/total_trans_g_e.loc[('M', 'offer received')].values[0])*100,2)}% of total offers received.")
print(f"Number of offer received: {total_trans_g_e.loc[('M', 'offer received')].values[0]}, {round((total_trans_g_e.loc[('M', 'offer received')].values[0]/total_trans_go_o_t_offers_m)*100,2)}% of total.")
print(f"Number of offer viewed: {total_trans_g_e.loc[('M', 'offer viewed')].values[0]}, {round((total_trans_g_e.loc[('M', 'offer viewed')].values[0]/total_trans_go_o_t_offers_m)*100,2)}% of total.")
print(f"Number of transaction: {total_trans_g_e.loc[('M', 'transaction')].values[0]}, {round((total_trans_g_e.loc[('M', 'transaction')].values[0]/total_trans_go_o_t)*100,2)}% of total.")
For Females: Number of offer completed: 15477, 56.37% of total offers received. Number of offer received: 27456, 43.09% of total. Number of offer viewed: 20786, 32.62% of total. Number of transaction: 49382, 43.66% of total. For Males: Number of offer completed: 16466, 43.18% of total offers received. Number of offer received: 38129, 46.0% of total. Number of offer viewed: 28301, 34.14% of total. Number of transaction: 72794, 64.36% of total.
Females completed 56% of the offers they received, it is 13% more than males, but males made more transactions than females, 64% to 43%.
Now since I did alot of changes to the transcript dataframe and it takes time to do the changes evry run, i'll convert the dataframe to csv file and just run it whenever I come back to work on the project. I will add one last column for the datafame which is income that might help us in the model (It takes time to run).
def get_customer_income(profile_id):
income = profile[profile['id'] == profile_id]['income'].values[0]
return income
transcript['income'] = transcript.apply(lambda x: get_customer_income(x['person']), axis=1)
transcript.to_csv('transcript_updated.csv')
transcript_up = pd.read_csv('transcript_updated.csv').iloc[:, 1:]
transcript_up.isna().sum()
person 0 event 0 time 0 offer_id 138953 amount 0 reward 0 offer_type 138953 age_group 0 gender 33772 income 0 dtype: int64
Before working on the model I will take care of the NaN values in gender, offer_type and offer_id. offer_id and offer_type are NaN because theose records are transactions, so they are not offer received, viewed or completed, because of that, we have the same number of NaNs in offer_type since they are not offers. I will replace NaNs in both columns with NA. Similarly in gender, some didn't tell their gender and we saw that before in profile dataframe, and we replaced it with NA.
fill_na = ['offer_id', 'offer_type', 'gender']
for i in fill_na:
transcript_up[i] = transcript_up[i].fillna('NA')
transcript_up.isna().sum()
person 0 event 0 time 0 offer_id 0 amount 0 reward 0 offer_type 0 age_group 0 gender 0 income 0 dtype: int64
In this part, I will try to make a model that can identify which kind of offers we should give a customer. First lets take quick look at our final dataframe before modeling.
transcript_up.head(5)
| person | event | time | offer_id | amount | reward | offer_type | age_group | gender | income | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | offer received | 0 | 9b98b8c7a33c4b65b9aebfe6a799e6d9 | 0 | 0 | bogo | elderly | F | 100000.000000 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | offer received | 0 | 0b1e1539f2cc45b7b9fa7c272da2e1d7 | 0 | 0 | discount | elderly | NA | 65404.991568 |
| 2 | e2127556f4f64592b11af22de27a7932 | offer received | 0 | 2906b810c7d4411798c6938adc9daaa5 | 0 | 0 | discount | elderly | M | 70000.000000 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | offer received | 0 | fafdcd668e3743c1bb461111dcafc2a4 | 0 | 0 | discount | elderly | NA | 65404.991568 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | offer received | 0 | 4d5c57ea9a6940dd891ad53e9dbe8da0 | 0 | 0 | bogo | elderly | NA | 65404.991568 |
transcript_up.dtypes
person object event object time int64 offer_id object amount int64 reward int64 offer_type object age_group object gender object income float64 dtype: object
Because my model will guess the offer_type, I will only get those transcripts with offer id's.
transcript_up = transcript_up[transcript_up['offer_id'] != 'NA']
Now, we should split our dataframe to features and target. Our features here are:
And my target is offer_type. For my target, I will replace texts with numbers. Where BOGO = 1, discount = 2, informational = 3.
# reference: https://www.datacamp.com/community/tutorials/categorical-data
labels_event = transcript_up['event'].astype('category').cat.categories.tolist()
replace_map_comp_event = {'event' : {k: v for k,v in zip(labels_event,list(range(1,len(labels_event)+1)))}}
print(replace_map_comp_event)
labels_offer_id = transcript_up['offer_id'].astype('category').cat.categories.tolist()
replace_map_comp_offer_id = {'offer_id' : {k: v for k,v in zip(labels_offer_id,list(range(1,len(labels_offer_id)+1)))}}
print(replace_map_comp_offer_id)
labels_age_group = transcript_up['age_group'].astype('category').cat.categories.tolist()
replace_map_comp_age_group = {'age_group' : {k: v for k,v in zip(labels_age_group,list(range(1,len(labels_age_group)+1)))}}
print(replace_map_comp_age_group)
labels_gender = transcript_up['gender'].astype('category').cat.categories.tolist()
replace_map_comp_gender = {'gender' : {k: v for k,v in zip(labels_gender,list(range(1,len(labels_gender)+1)))}}
print(replace_map_comp_gender)
{'event': {'offer completed': 1, 'offer received': 2, 'offer viewed': 3}}
{'offer_id': {'0b1e1539f2cc45b7b9fa7c272da2e1d7': 1, '2298d6c36e964ae4a3e7e9706d1fb8c2': 2, '2906b810c7d4411798c6938adc9daaa5': 3, '3f207df678b143eea3cee63160fa8bed': 4, '4d5c57ea9a6940dd891ad53e9dbe8da0': 5, '5a8bc65990b245e5a138643cd4eb9837': 6, '9b98b8c7a33c4b65b9aebfe6a799e6d9': 7, 'ae264e3637204a6fb9bb56bc8210ddfd': 8, 'f19421c1d4aa40978ebb69ca19b0e20d': 9, 'fafdcd668e3743c1bb461111dcafc2a4': 10}}
{'age_group': {'adult': 1, 'elderly': 2, 'teen': 3, 'young adult': 4}}
{'gender': {'F': 1, 'M': 2, 'NA': 3, 'O': 4}}
# reference: https://www.datacamp.com/community/tutorials/categorical-data
labels_offer_type = transcript_up['offer_type'].astype('category').cat.categories.tolist()
replace_map_comp_offer_type = {'offer_type' : {k: v for k,v in zip(labels_offer_type,list(range(1,len(labels_offer_type)+1)))}}
print(replace_map_comp_offer_type)
{'offer_type': {'bogo': 1, 'discount': 2, 'informational': 3}}
# replace categorical with numerical
transcript_up.replace(replace_map_comp_event, inplace=True)
transcript_up.replace(replace_map_comp_offer_id, inplace=True)
transcript_up.replace(replace_map_comp_age_group, inplace=True)
transcript_up.replace(replace_map_comp_gender, inplace=True)
transcript_up.replace(replace_map_comp_offer_type, inplace=True)
transcript_up.head()
| person | event | time | offer_id | amount | reward | offer_type | age_group | gender | income | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 78afa995795e4d85b5d9ceeca43f5fef | 2 | 0 | 7 | 0 | 0 | 1 | 2 | 1 | 100000.000000 |
| 1 | a03223e636434f42ac4c3df47e8bac43 | 2 | 0 | 1 | 0 | 0 | 2 | 2 | 3 | 65404.991568 |
| 2 | e2127556f4f64592b11af22de27a7932 | 2 | 0 | 3 | 0 | 0 | 2 | 2 | 2 | 70000.000000 |
| 3 | 8ec6ce2a7e7949b1bf142def7d0e0586 | 2 | 0 | 10 | 0 | 0 | 2 | 2 | 3 | 65404.991568 |
| 4 | 68617ca6246f4fbc85e91a2a49552598 | 2 | 0 | 5 | 0 | 0 | 1 | 2 | 3 | 65404.991568 |
# Split the data into features and target label
target = transcript_up['offer_type']
features = transcript_up.drop(['person', 'offer_type'], axis = 1)
features.head()
| event | time | offer_id | amount | reward | age_group | gender | income | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0 | 7 | 0 | 0 | 2 | 1 | 100000.000000 |
| 1 | 2 | 0 | 1 | 0 | 0 | 2 | 3 | 65404.991568 |
| 2 | 2 | 0 | 3 | 0 | 0 | 2 | 2 | 70000.000000 |
| 3 | 2 | 0 | 10 | 0 | 0 | 2 | 3 | 65404.991568 |
| 4 | 2 | 0 | 5 | 0 | 0 | 2 | 3 | 65404.991568 |
target.head()
0 1 1 2 2 2 3 2 4 1 Name: offer_type, dtype: int64
We should normalize the numerical values ( Time, amount, reward, income ) because we will use them as features.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
to_normalize = ['time', 'amount', 'reward', 'income']
features[to_normalize] = scaler.fit_transform(features[to_normalize])
features.head()
| event | time | offer_id | amount | reward | age_group | gender | income | |
|---|---|---|---|---|---|---|---|---|
| 0 | 2 | 0.0 | 7 | 0.0 | 0.0 | 2 | 1 | 0.777778 |
| 1 | 2 | 0.0 | 1 | 0.0 | 0.0 | 2 | 3 | 0.393389 |
| 2 | 2 | 0.0 | 3 | 0.0 | 0.0 | 2 | 2 | 0.444444 |
| 3 | 2 | 0.0 | 10 | 0.0 | 0.0 | 2 | 3 | 0.393389 |
| 4 | 2 | 0.0 | 5 | 0.0 | 0.0 | 2 | 3 | 0.393389 |
from sklearn.model_selection import train_test_split, GridSearchCV
X_train, X_test, y_train, y_test = train_test_split(features, target, random_state=0)
print('Training Features Shape:', X_train.shape)
print('Training Labels Shape:', y_train.shape)
print('Testing Features Shape:', X_test.shape)
print('Testing Labels Shape:', y_test.shape)
Training Features Shape: (125685, 8) Training Labels Shape: (125685,) Testing Features Shape: (41896, 8) Testing Labels Shape: (41896,)
Since we have a simple classification problem, I will use accuracy to evaluate my models. We want to see how well our model by seeing the number of correct predictions vs total number of predicitons.
I'll try different models to pick the best out of them.
1. LogisticRegression
# reference: https://towardsdatascience.com/random-forest-in-python-24d0893d51c0
def pred_score(model):
pred = model.predict(X_test)
# Calculate the absolute errors
errors = abs(pred - y_test)
# Calculate mean absolute percentage error
mape = 100 * (errors / y_test)
accuracy = 100 - np.mean(mape)
return round(accuracy, 2)
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print(f'Accuracy of Logistic regression classifier on training set: {round(logreg.score(X_train, y_train)*100,2)}%.')
print(f'Prediction Accuracy: {pred_score(logreg)}%')
Accuracy of Logistic regression classifier on training set: 64.32%. Prediction Accuracy: 79.73%
2. K-Nearest Neighbors
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier()
knn.fit(X_train, y_train)
print(f'Accuracy of K-NN classifier on training set: {round(knn.score(X_train, y_train)*100,2)}%.')
print(f'Prediction Accuracy: {pred_score(knn)}%')
Accuracy of K-NN classifier on training set: 100.0%. Prediction Accuracy: 100.0%
3. Decision Tree
from sklearn.tree import DecisionTreeClassifier
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
print(f'Accuracy of Decision Tree classifier on training set: {round(dt.score(X_train, y_train)*100,2)}%.')
print(f'Prediction Accuracy: {pred_score(dt)}%')
Accuracy of Decision Tree classifier on training set: 100.0%. Prediction Accuracy: 100.0%
4. Support Vector Machine
from sklearn.svm import SVC
svm = SVC()
svm.fit(X_train, y_train)
print(f'Accuracy of SVM classifier on training set: {round(svm.score(X_train, y_train)*100,2)}%.')
print(f'Prediction Accuracy: {pred_score(svm)}%')
Accuracy of SVM classifier on training set: 91.46%. Prediction Accuracy: 94.31%
5. Naive Bayes
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X_train, y_train)
print(f'Accuracy of Naive Bayes classifier on training set: {round(gnb.score(X_train, y_train)*100,2)}%.')
print(f'Prediction Accuracy: {pred_score(gnb)}%')
Accuracy of Naive Bayes classifier on training set: 48.5%. Prediction Accuracy: 25.06%
6. Random Forest
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(n_estimators = 100, random_state = 42)
rf.fit(X_train, y_train)
print(f'Accuracy of Random Forest classifier on training set: {round(rf.score(X_train, y_train)*100,2)}%.')
print(f'Prediction Accuracy: {pred_score(rf)}%')
Accuracy of Random Forest classifier on training set: 100.0%. Prediction Accuracy: 100.0%
# reference: https://stackoverflow.com/a/52768022
models = [logreg, knn, dt, svm, gnb, rf]
model_names = [type(n).__name__ for n in models]
tr_accuracy = [x.score(X_train, y_train)*100 for x in models]
pred_accuracy = [pred_score(y) for y in models]
results = [tr_accuracy, pred_accuracy]
results_df = pd.DataFrame(results, columns = model_names, index=['Training Accuracy', 'Predicting Accureacy'])
results_df
| LogisticRegression | KNeighborsClassifier | DecisionTreeClassifier | SVC | GaussianNB | RandomForestRegressor | |
|---|---|---|---|---|---|---|
| Training Accuracy | 64.324303 | 99.997613 | 100.0 | 91.464375 | 48.501412 | 100.0 |
| Predicting Accureacy | 79.730000 | 100.000000 | 100.0 | 94.310000 | 25.060000 | 100.0 |
The above table, shows the accuracy score related with using different models of supervised learning. As presented on the table, we had 100% accuracy in both training and testing sets for 3 models ( out of 6). To avoid over fitting as much as possible, I will choose the model that gave me the middle accuracy score on the testing set,which is the Logistic Regression 79.73%. On this model , I got 79.73% on testing set. I know that this is a resonabley good score, but the other scores are higher than that (except GaussianNB). Logistic Regression is better used here since we have few binomial outcomes ( BOGO = 1, discount = 2, informational = 3 ). It is good here because we have good amount of data to work with.
The result from my Logistic Regression is good enough and we got fair number and avoided overfitting. But lets try to improve it little bit.
# Tuning to get better accuracy
dual = [True, False]
max_iter = [100, 120, 140, 160]
C = [1.0,1.5]
param_grid = dict(dual = dual, max_iter = max_iter, C = C)
lr = LogisticRegression(penalty='l2')
grid = GridSearchCV(estimator = lr, param_grid = param_grid, cv = 3, n_jobs = -1)
grid_result = grid.fit(X_train, y_train)
print(f'Best Score: {grid_result.best_score_}')
print(f'Best params: {grid_result.best_params_}')
Best Score: 0.6468791025182002
Best params: {'C': 1.5, 'dual': False, 'max_iter': 120}
We improve slightly, 0.36% more. I still think it is good as it is and we don't need to try to get better results.
Althgough I believe on the saying " There is always a room for Improvement", But I think that the KNeighborsClassifier model is giving me a really good score. Trying to improve such model will surely cause us to get into the fault of Overfitting. So, I will not suggest any improvement on this model since I believe that we don't need to try to get better results.
In this project, I tried to analyze and make model to predict the best offer to give a Starbucks customer. First I explored the data and see what I have to change before start the analysis. Then I did some exploratory analysis on the data after cleaning. From that analysis I found out that most favorite type of offers are Buy One Get One (BOGO) offers and Discount offers. I digged deep to see who and what type of customers we have and noticed that Females tend to complete offers more than males with 56% completion of the offers they received. Where Males completed only 43.18% from the offers they received. But our current data shows that we gave males more offers since they have more transactions than females with total number of 72794 transactions, where females only had 49382 transactions. In conclusion, the company should give more offers to Females than Males since they have more completed offers. And they should focus more on BOGO and Discount offers since they are the one that tend to make customers buy more.
I think I got to a point where we have good results and we understand the data we have very well. But to make our results even better, I would try to improve my data collection and fix issues I have with NaN values. I will also try to get even more data like location and when the transaction were completed, which branch and what time of the day. All these data can help us know when and where to give our offers. Also having more data is always good think to help us improve our model results.